This is an R Markdown Notebook. It is the main r file including the related code for the project for the STA 206.
library(MASS)
library(ggplot2)
library(GGally)
We first read in the data and develope an exploratory data analysis.
col_names <- c('sex','length','diameter','height','weight.w','weight.s','weight.v','weight.sh','rings')
data <- read.table('../dataset/abalone.txt', header=FALSE, sep=',', col.names= col_names)
is.na(data$frame) #check missing value
logical(0)
summary(data)
sex length diameter height weight.w weight.s weight.v weight.sh rings
F:1307 Min. :0.075 Min. :0.0550 Min. :0.0000 Min. :0.0020 Min. :0.0010 Min. :0.0005 Min. :0.0015 Min. : 1.000
I:1342 1st Qu.:0.450 1st Qu.:0.3500 1st Qu.:0.1150 1st Qu.:0.4415 1st Qu.:0.1860 1st Qu.:0.0935 1st Qu.:0.1300 1st Qu.: 8.000
M:1528 Median :0.545 Median :0.4250 Median :0.1400 Median :0.7995 Median :0.3360 Median :0.1710 Median :0.2340 Median : 9.000
Mean :0.524 Mean :0.4079 Mean :0.1395 Mean :0.8287 Mean :0.3594 Mean :0.1806 Mean :0.2388 Mean : 9.934
3rd Qu.:0.615 3rd Qu.:0.4800 3rd Qu.:0.1650 3rd Qu.:1.1530 3rd Qu.:0.5020 3rd Qu.:0.2530 3rd Qu.:0.3290 3rd Qu.:11.000
Max. :0.815 Max. :0.6500 Max. :1.1300 Max. :2.8255 Max. :1.4880 Max. :0.7600 Max. :1.0050 Max. :29.000
No missing data observed.
Investigate qualitative and quantitative variables:
sapply(data, class)
sex length diameter height weight.w weight.s weight.v weight.sh rings
"factor" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "integer"
It’s observed that only sex is quantitative variable, we’ll consider indicator variable later.
It is interested to see the height data has minimum of zero. Aparrently, this is not possible. It might be the reason of the wrong input, but these data points should be excluded.
#d_data <- data[data$height == 0,] #get index of error data
#index_d_data <- as.numeric(rownames(d_data))
#data <- data[-(index_d_data)]
data <- subset(data, data$height != 0)
summary(data)
sex length diameter height weight.w weight.s weight.v weight.sh rings
F:1307 Min. :0.0750 Min. :0.0550 Min. :0.0100 Min. :0.0020 Min. :0.0010 Min. :0.0005 Min. :0.0015 Min. : 1.000
I:1340 1st Qu.:0.4500 1st Qu.:0.3500 1st Qu.:0.1150 1st Qu.:0.4422 1st Qu.:0.1862 1st Qu.:0.0935 1st Qu.:0.1300 1st Qu.: 8.000
M:1528 Median :0.5450 Median :0.4250 Median :0.1400 Median :0.8000 Median :0.3360 Median :0.1710 Median :0.2340 Median : 9.000
Mean :0.5241 Mean :0.4079 Mean :0.1396 Mean :0.8290 Mean :0.3595 Mean :0.1807 Mean :0.2388 Mean : 9.935
3rd Qu.:0.6150 3rd Qu.:0.4800 3rd Qu.:0.1650 3rd Qu.:1.1535 3rd Qu.:0.5020 3rd Qu.:0.2530 3rd Qu.:0.3287 3rd Qu.:11.000
Max. :0.8150 Max. :0.6500 Max. :1.1300 Max. :2.8255 Max. :1.4880 Max. :0.7600 Max. :1.0050 Max. :29.000
2 observed error data were deleted.
Also the predictor weight.w should be the linear function of variables weight.s, weight.v and weight.sh, which is weight.w = weight.s + weight.v+ weight.sh + weight loss during weighing operations
data$weight.loss <- data$weight.w - data$weight.s - data$weight.v - data$weight.sh
nrow(data[data$weight.loss < 0,]) # number of data with whole weight less than sum of other three weights
[1] 153
data <- subset(data, data$weight.loss >= 0)
summary(data)
sex length diameter height weight.w weight.s weight.v weight.sh rings
F:1280 Min. :0.1100 Min. :0.0900 Min. :0.0150 Min. :0.0080 Min. :0.0025 Min. :0.0005 Min. :0.0030 Min. : 2
I:1255 1st Qu.:0.4550 1st Qu.:0.3500 1st Qu.:0.1150 1st Qu.:0.4551 1st Qu.:0.1905 1st Qu.:0.0960 1st Qu.:0.1345 1st Qu.: 8
M:1487 Median :0.5450 Median :0.4250 Median :0.1450 Median :0.8117 Median :0.3412 Median :0.1725 Median :0.2360 Median :10
Mean :0.5272 Mean :0.4104 Mean :0.1405 Mean :0.8408 Mean :0.3632 Mean :0.1825 Mean :0.2415 Mean :10
3rd Qu.:0.6150 3rd Qu.:0.4800 3rd Qu.:0.1650 3rd Qu.:1.1625 3rd Qu.:0.5065 3rd Qu.:0.2554 3rd Qu.:0.3300 3rd Qu.:11
Max. :0.8150 Max. :0.6500 Max. :1.1300 Max. :2.8255 Max. :1.4880 Max. :0.7600 Max. :1.0050 Max. :29
weight.loss
Min. :0.00000
1st Qu.:0.01950
Median :0.03850
Mean :0.05358
3rd Qu.:0.06950
Max. :0.60800
153 data were removed.
The ggpairs plot below has same functionality as scatter plot, boxplots, and histogram plots
ggpairs(data, aes(colour = sex, alpha = 0.8), title="Pairs plot for abalone dataset") + theme_grey(base_size = 8)
Pan plots of sex (need pie charts for all predictors?)
sex_value <- c(1307, 1342, 1528)
sex <- c("F", "I", "M")
piepercent <- round(100*sex_value/sum(sex_value), 1)
pieplot <- pie(sex_value, labels = piepercent, main="Abalone sex pie chart", col = rainbow(length(sex_value)))
legend("topright", sex, cex = 0.8, fill = rainbow(length(sex_value)))
Create a new variable N = M + F, change to binary indicator variable
data['sex'] <- ifelse(data$sex == 'I', 'I', 'N')
data$sex <- as.factor(data$sex)
summary(data)
sex length diameter height weight.w weight.s weight.v weight.sh rings
I:1255 Min. :0.1100 Min. :0.0900 Min. :0.0150 Min. :0.0080 Min. :0.0025 Min. :0.0005 Min. :0.0030 Min. : 2
N:2767 1st Qu.:0.4550 1st Qu.:0.3500 1st Qu.:0.1150 1st Qu.:0.4551 1st Qu.:0.1905 1st Qu.:0.0960 1st Qu.:0.1345 1st Qu.: 8
Median :0.5450 Median :0.4250 Median :0.1450 Median :0.8117 Median :0.3412 Median :0.1725 Median :0.2360 Median :10
Mean :0.5272 Mean :0.4104 Mean :0.1405 Mean :0.8408 Mean :0.3632 Mean :0.1825 Mean :0.2415 Mean :10
3rd Qu.:0.6150 3rd Qu.:0.4800 3rd Qu.:0.1650 3rd Qu.:1.1625 3rd Qu.:0.5065 3rd Qu.:0.2554 3rd Qu.:0.3300 3rd Qu.:11
Max. :0.8150 Max. :0.6500 Max. :1.1300 Max. :2.8255 Max. :1.4880 Max. :0.7600 Max. :1.0050 Max. :29
weight.loss
Min. :0.00000
1st Qu.:0.01950
Median :0.03850
Mean :0.05358
3rd Qu.:0.06950
Max. :0.60800
Data splitting (training~70%, validation~30%)
set.seed(40) #set seed for random number generator
n_data <- nrow(data)
indexes <- sample(1:n_data, size = 0.3*n_data)
data_validation <- data[indexes,]
data_train <- data[-indexes,]
Examine the similarity of training set and validation set
par(mfrow=c(3,3))
boxplot(data_train$length, data_validation$length, col = 'orange', main = 'length', names=c('train', 'validation'))
boxplot(data_train$diameter, data_validation$diameter, col = 'orange', main = 'diameter', names=c('train', 'validation'))
boxplot(data_train$height, data_validation$height, col = 'orange', main = 'height', names=c('train', 'validation'))
boxplot(data_train$weight.w, data_validation$weight.w, col = 'orange', main = 'whole weight', names=c('train', 'validation'))
boxplot(data_train$weight.s, data_validation$weight.s, col = 'orange', main = 'shucked weight', names=c('train', 'validation'))
boxplot(data_train$weight.v, data_validation$weight.v, col = 'orange', main = 'viscera weight', names=c('train', 'validation'))
boxplot(data_train$weight.sh, data_validation$weight.sh, col = 'orange', main = 'shell weight', names=c('train', 'validation'))
boxplot(data_train$rings, data_validation$rings, col = 'orange', main = 'rings', names=c('train', 'validation'))
The data distribution looks similar.
The first model: Additive Multiple Linear Regression Model